Fix RAW (read after write) race hazard in theta-l_kokkos limiter #7858

amametjanov · 2025-11-02T15:08:26Z

Also remove (slow) profiling calls.

[BFB]

The hangs are not reproducible with plain-vanilla --compset F2010-SCREAMv1 --res ne256pg2_ne256pg2: need to trigger caar loop dp3d limiter. If triggered, hang can occur within Lm1: need to run at least 1 model-month.

Also remove (slow) profiling calls

amametjanov · 2025-11-02T15:14:44Z

Testing on Aurora with
run.ne256pg2_ne256pg2.F2010-SCREAMv1.sh :

base: hang after ~16 mdays on YYYYMMDD 20180114
test: continue past the hang with ~1.41 sypd

amametjanov

It'd be good to remove the two printf's from the GPU kernel.

amametjanov · 2025-11-02T15:19:55Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

          Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n",
                         k+1,dp_as_real(k),dp0_as_real(k));
        }
 #endif


Also, i'd like to disable these prints inside GPU kernels and print them only in debug-mode: i.e.

+#ifndef NDEBUG #ifndef HOMMEXX_BFB_TESTING if(diff_as_real(k) < 0){ Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n", k+1,dp_as_real(k),dp0_as_real(k)); } +#endif #endif

Is that okay?

I think removing from kernels is a good idea. But we do want to give user some warning about this --- these warnings are very helpful when debugging instability issues (which is still a relatively common problem in F2010-SCREAMv1 variety setups). I think @mt5555 should also be consulted before completely removing these warnings as well.

One idea might be to capture/set a flag inside the kernel; then outside the kernel, if it's true (i.e., a condition is triggered), we print a statement. The specifics (levels and values) are helpful, but we can live with a simple warning (just saying the threshold was triggered) without the specifics if we must

A || reduce is definitely a reasonable strategy. It is already implemented in the CaarFunctor, where we want to print an extensive picture of the state before crashing, if something is amiss in the functor execution.

Edit: I would also be against making the code print in debug mode only. For relatively large cases, people are not so likely to run the code in debug mode, as it would be remarkably slower (Kokkos is MUCH slower in debug mode), making debuggin painful.

amametjanov · 2025-11-02T15:24:54Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

          vtheta_dp(ilev) *= dp(ilev);
        });
      } //end of min_diff < 0
+      kv.team_barrier();


Okay to disable this print too: i.e. print only in debug-mode ?

for (int ivec=0; ivec<VECTOR_SIZE; ++ivec) { if ( (vtheta_dp(ilev)[ivec] - m_vtheta_thresh*dp(ilev)[ivec]) < 0) { +#ifndef NDEBUG #ifndef HOMMEXX_BFB_TESTING Kokkos::printf("WARNING:CAAR: k=%d,theta(k)=%f<%f=th_thresh, applying limiter \n", ilev*VECTOR_SIZE+ivec+1,vtheta_dp(ilev)[ivec]/dp(ilev)[ivec],m_vtheta_thresh); +#endif #endif vtheta_dp(ilev)[ivec]=m_vtheta_thresh*dp(ilev)[ivec]; }

amametjanov · 2025-11-02T15:31:20Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

 #endif
        result = result<=diff_as_real(k) ? result : diff_as_real(k);
      }, reducer);
+      kv.team_barrier();


This barrier is needed to finish updating min_diff, which is used below

As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.

We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.

I second @bartgol, these should not be here. I'm curious if parallel_reduce(team) implies a sync over team threads though, similar to how parallel_reduce(exec_space) implies a fence over execution space.

Some background: we started adding these due to GPU race conditions on various target machines, knowing they weren't quite right for arbitrary threading configurations. The commit I linked does the right thing by keeping the syncs but rewriting loops so syncs occur at the top level.

amametjanov · 2025-11-02T15:33:32Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

-        kv.team_barrier();
-


Moving this barrier out of if (min_diff<0) { branch to avoid possible stalls waiting for threads in the else branch.

This is a tricky issue. It's definitely wrong to have a team_barrier in a conditional like this, since it can lead to deadlock. But if NUM_LEV != NUM_PHYSICAL_LEV, I think there can be a race condition in the parallel_for-parallel_reduce sequence that this team_barrier is separating. On the GPU, NUM_LEV == NUM_PHYSICAL_LEV, and on the CPU, we don't actually have threading here in practice, so this point is not relevant in practice.

Nonetheless, I had a branch going a long time ago that fixed this type of issue in a few spots in HOMME, but I ended up getting distracted by other machine issues and never got this branch in. If might be worth consulting this commit to see how to keep this team_barrier. (The ComposeTransport change is no longer needed, but the three others are still relevant, although again only in threading/vectorization configurations we don't run in practice.)

The changes in this PR are fine if keeping them is preferred.

Andrew, I think we should get your commit in (at least, the part that fixes the team barrier). While I agree that in practice it doesn't make a difference now, we don't know if our threading schemes will change 5 yy from now, and catching this kind of bugs would take time. Better put in the correct code now, and be safe later.

Btw, I thought @oksanaguba had fixed this as well in one of her branches a while ago, no?

I think we should get your commit in

Ok. For clarity, I'm not on this project this year, so someone else would need to bring in the changes.

amametjanov · 2025-11-02T15:35:30Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

          vtheta_dp(ilev) *= dp(ilev);
        });
      } //end of min_diff < 0
+      kv.team_barrier();


Adding kv.team_barrier() here to finish updating vtheta_dp, which is read below.

ambrad

This is an improvement and fixes an important problem, so I see no reason to make further changes in this PR despite the good ideas for more changes. Those can go in a separate PR. Just my opinion: don't necessarily take this approval as definitive.

bartgol · 2025-11-03T20:11:32Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

        diff(ilev) = (dp(ilev) - m_dp3d_thresh*dp0(ilev))*spheremp;
      });

      kv.team_barrier();


For the record, I think this team barrier is also wrong, as it sits inside a TeamThreadRange. While in practice it's ok, since our team sizes always divide NP*NP, if someone somehow launched with team_size, say, 10, I think this may cause a deadlock.

I do believe @oksanaguba had run into these barriers at some point, and fixed them. Perhaps the fixes never made it to master. Oksana, do you recall anything like that? Do you know which branch may have all fixes for this stuff? We can of course fix them here, but I want to make sure we don't leave any issue behind, while we're at it...

Luca, I agree about the barriers. The commit I linked handles that issue, as well.

@amametjanov if it's ok with you, I can extract the relevant part from Andrew's commit, and push to this branch.

yes, please feel free to update as needed!

bartgol · 2025-11-03T20:18:31Z

components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp

 #endif
        result = result<=diff_as_real(k) ? result : diff_as_real(k);
      }, reducer);
+      kv.team_barrier();


As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.

We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.

Fix RAW (read after write) race hazard in theta-l_kokkos limiter

e055365

Also remove (slow) profiling calls

amametjanov self-assigned this Nov 2, 2025

amametjanov added bug fix PR BFB PR leaves answers BFB HOMME Aurora ALCF machine Aurora labels Nov 2, 2025

amametjanov commented Nov 2, 2025

View reviewed changes

amametjanov requested review from ambrad, bartgol, oksanaguba and tcclevenger November 2, 2025 15:28

amametjanov commented Nov 2, 2025

View reviewed changes

mahf708 requested a review from mt5555 November 3, 2025 02:58

ambrad approved these changes Nov 3, 2025

View reviewed changes

bartgol reviewed Nov 3, 2025

View reviewed changes

Fix RAW (read after write) race hazard in theta-l_kokkos limiter #7858

Are you sure you want to change the base?

Fix RAW (read after write) race hazard in theta-l_kokkos limiter #7858

Conversation

amametjanov commented Nov 2, 2025

Uh oh!

amametjanov commented Nov 2, 2025

Uh oh!

amametjanov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahf708 Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartgol Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amametjanov Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambrad Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amametjanov Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambrad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mahf708 Nov 3, 2025 •

edited

Loading

bartgol Nov 3, 2025 •

edited

Loading

amametjanov Nov 2, 2025 •

edited

Loading

ambrad Nov 4, 2025 •

edited

Loading

amametjanov Nov 2, 2025 •

edited

Loading